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Abstract —Automated computer-aided detection (CADe) in 
medical imaging has been an important tool in clinical practice 
and research. State-of-the-art methods often show high sensi¬ 
tivities but at the cost of high false-positives (FP) per patient 
rates. We design a two-tiered coarse-to-fine cascade framework 
that first operates a candidate generation system at sensitivities 
of ~100% but at high FP levels. By leveraging existing CAD 
systems, coordinates of regions or volumes of interest (ROI or 
VOI) for lesion candidates are generated in this step and function 
as input for a second tier, which is our focus in this study. In this 
second stage, we generate N 2D (two-dimensional) or 2.5D views 
via sampling through scale transformations, random translations 
and rotations with respect to each ROI’s centroid coordinates. 
These random views are used to train deep convolutional neural 
network (ConvNet) classifiers. In testing, the trained ConvNets 
are employed to assign class (e.g., lesion, pathology) probabilities 
for a new set of N random views that are then averaged at each 
ROI to compute a final per-candidate classification probability. 
This second tier behaves as a highly selective process to reject 
difficult false positives while preserving high sensitivities. The 
methods are evaluated on three different data sets with different 
numbers of patients: 59 patients for sclerotic metastases detec¬ 
tion, 176 patients for lymph node detection, and 1,186 patients 
for colonic polyp detection. Experimental results show the ability 
of ConvNets to generalize well to different medical imaging 
CADe applications and scale elegantly to various data sets. Our 
proposed methods improve CADe performance markedly in all 
cases. CADe sensitivities improved from 57% to 70%, from 43% 
to 77% and from 58% to 75% at 3 FPs per patient for sclerotic 
metastases, lymph nodes and colonic polyps, respectively. 


1. Introduction 

A ccurate computer-aided detection (CADe) plays a 
central role in radiological diagnoses. The early detection 
of abnormal anatomies or precursors of pathology associated 
with cancer can aid in preventing the disease, which is among 
the leading causes of death worldwide Q . Furthermore, detec¬ 
tion can help to assess the staging of a patient’s disease, and 
thus has the potential to alter a patients required treatment reg¬ 
imen ||2l. Computed tomography (CT), a ubiquitous screening 
and staging modality employed for disease detection in cancer 
patients, is commonly used for the detection of abnormal 
anatomy such as tumors and their metastases. At present, the 
detection of an abnormal anatomy via CT often occurs during 
manual prospective visual inspection of every image slice (of 
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which there may be thousands) and every section of every 
image in each patient’s CT study. This is a complex process 
that, when performed under a time restriction, is prone to 
error. Thorough manual assessment and processing is time- 
consuming and often delays the clinical workflow. Therefore 
CADe has the potential to greatly reduce the radiologists’ 
clinical workload and to serve as a first or second reader for 
improved assessment of the disease 0, El, 0. 

CADe has been an active research area in medical imaging 
for the last two decades. Most work is based on some type of 
image feature extractor that is computed in a region-of-interest 
(ROI) in the image, e.g. intensity statistics, histogram of 
oriented gradients (HoG) 0, scale-invariant feature transform 
(SIFT) 0, Hessian based shape descriptors (such as blobness) 
ID, etc. These features are then used to learn a binary or 
discrete classifier, commonly linear support vector machines 
(SVM) and random forests, to differentiate normal from abnor¬ 
mal anatomy. At present, examples of CADe used in clinical 
practice include polyp detection for colon cancer screening 
(D, Col, lung nodule detection for lung cancer screening 
CD, C3 or breast cancer screening with mammography ca. 
However, many applications of CADe result in significantly 
low sensitivity and/or specificity levels (i.e. high numbers of 
false negatives or false positives per volume). For this reason, 
they have not yet been incorporated into clinical practice. 

The method presented here aims to build upon existing 
CADe systems by forming a hierarchical two-tiered CADe 
system, designed to improve overall detection performance 
(i.e., high recalls together with low, or manageable FP rates per 
patient). To this end, we propose a new representation that ef¬ 
ficiently integrates recent advances in computer vision, namely 
deep convolutional neural networks 03, CD (ConvNets, see 
Fig.g. 

Recently, the availability of large amounts of annotated 
training sets and the accessibility of affordable parallel com¬ 
puting resources via Graphics Processing Units (or GPUs) 
have made it feasible to train deep convolutional neural 
networks (ConvNets). ConvNets have popularized the topic of 
“deep learning” in computer vision research CD. The usage 
of ConvNets has allowed for substantial advancements not 
only in the classification of natural images (H), but also in 
biomedical applications, such as mitosis detection in digital 
pathology CD, CD- Additionally, recent work has shown 
how the implementation of ConvNets can substantially im¬ 
prove the performance of state-of-the-art CADe systems CD, 
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EqI, ED, EU. For instance, HU proposes an MRI-based 
knee cartilage segmentation using a triplanar ConvNet. 1^ 
describes a supervised 3D boundary detection in volumetric 
electron microscopy (EM) images via ConvNets. 

In this study, we apply ConvNets along with random sets 
of 2D or 2.5D sampled views or observations. Our work 
partly draws upon the idea of hybrid systems, which use 
both parametric and non-parametric models for hierarchical 
coarse-to-fine classification. (241. The non-parametric model is 
replaced with aggregating decisions via ConvNets performed 
on random views. 

Our contributions are the following: 

1) We propose a universal 2.5D image decomposition rep¬ 
resentation for utilizing ConvNets in CADe problems which 
can be generalized to others (with randomly sampled views 
or sampled under some problem-specific constraints, e.g., 
using local vessel orientations); 2) we propose a new random 
aggregation method based on the deep ConvNet classification 
approach; 3) we validate on three different datasets with 
different numbers of patients and CADe applications; 

and 4) markedly improve performance in all three cases. 
In particular, we improve CADe sensitivities from 57% to 
70%, from 43% to 77% and 58% to 75% at 3 FPs per 
patient for sclerotic metastases (H, lymph nodes (25l, (^ 
and colonic polyps Ea, ca, respectively. This paper extends 
our preliminary work on lymph node (201 and sclerotic bone 
metastasis detection ED and includes performance evaluation 
on a new data set for detecting 252 colonic polyps in 1,186 
patients. We show how ConvNets can be applied to build more 
accurate classifiers for CADe systems, as an effective false 
positive pruning process while maintaining high sensitivity 
recalls. 


II. Methods 

Here, we describe our methods in detail. First, deep convo¬ 
lutional networks (ConvNets) are introduced, then we describe 
how to apply ConvNets to CADe application in a 2D or 2.5D 
approach and how to utilize random ConvNet observations in 
the fashion of a decompositional representation. Lastly, we 
describe various ways of candidate generation (CG) that are 
applicable for the using ConvNets on different data sets. 

A. Convolutional Neural Networks 

ConvNets are named for their convolutional filters that are 
used to compute image features for classification (see Fig. [^. 
In this work, we use two cascaded layers of convolutional 
filters. All convolutional filter kernel elements are trained 
from the data in a supervised fashion by learning from a 
labeled set of examples. This has major advantages over more 
traditional CADe approaches that use hand-crafted features, 
designed from human experience. ConvNets have a better 
chance of capturing the “essence” of the imaging data set 
used for training than do hand-crafted features (161, (6l. (Til. 
M. Furthermore, we can train similarly configured ConvNet 
architectures from randomly initialized or pre-trained model 
parameters for detecting different lesions or pathologies (with 
heterogeneous appearances), with no manual intervention of 


system and feature design. Examples of trained filters of the 
first convolutional layer and their responses are shown in Fig. 
In-between convolutional layers, the ConvNet performs 
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Fig. 1. ConvNet applied to a 2.5D volume of interest extracted from a CT 
image. The number of convolutional filters, kernel sizes, and neural network 
connections for each layer are as shown. We use overlapping kernels with 
stride 2 during max-pooling. 


max-pooling operations in order to summarize feature re¬ 
sponses across neighboring pixels (see Fig.[^. Such operations 
allow the ConvNet to learn features that are spatially invariant 
with respect to the location of objects in the images. Feature 
responses after the second convolutional layer feed into two 
locally connected layers (similar to a convolutional layer 
but without weight sharing), and then fully-connected neural 
network layers for classification. The deeper the convolutional 
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Fig. 2. Features are computed by convolving filter kernels over the input 
region of interest. The input image can be padded to produce convolution 
responses of the same size as the input image. 
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Fig. 3. Some examples of hlter responses (Right) after convolution with 
trained ConvNet kernels (Middle) of the first layer (showing an example of a 
sclerotic bone lesion in CT (Left). 



layers in a ConvNet, the higher the order of image features 
they encode. This neural network learns how to interpret the 
feature responses and performs classifications. Our ConvNet 
uses a final softmax layer which provides a classification 
probability for each input image (see Fig. [^. In order to 
avoid overfitting, the fully-connected layers are constrained, 
using the ''DropConnect” method ||28l. DropConnect behaves 
as a regularizer when training the ConvNet by preventing co¬ 
adaptation of units in the neural network. It is a variation 
of the previously suggested '‘DropOuf method 1^ , l(30l . 
We use and modify an open-source implementation (cuda- 
convneJ^ by Krizhevsky et al. 03, Ell which efficiently 
trains the ConvNet by using GPU acceleration with the 
DropConnect modification by 1^ . Additional speed-ups are 
achieved by using rectified linear units as neuron activation 
functions, as opposed to the functions f{x) = tanh(x) or 
f{x) = + from traditional neuron models, in the 

training and evaluation phases ifT^ . The input image can be 
cropped in order to train on translations of the cropped input 
image for data augmentation ifTHl . Our ConvNets are trained 
using stochastic gradient descent with momentum for 700- 
300-100-100 epochs on mini-batches of 64-64-32-16 images 
similar to 1^ on the CIFAR-10 data set (using an initial 

^ https ://code.google.com/p/cuda- convnet 


learning rate of 0.001 with the default weight decay). The 
per-pixel mean of the training image set is subtracted from 
each image fed to the ConvNet. 

B. Applying ConvNets to CADe - a 2D or 2.5D Approach 

Depending on the imaging data, we explore a two- 
dimensional (2D) or two-and-a-half-dimensional (2.5D) rep¬ 
resentation to compute a ConvNet observation, sampled at 
each CADe candidate location (see Fig. [^. In 2D, we refer 
to extracting a Region-of-Interest (ROI). In 2.5D, we refer to 
extracting a Volume-of-Interest (VOI). CADe candidate loca¬ 
tions are normally obtained by a candidate generation process, 
which requires very high (i.e., close to 100%) sensitivity at 
high false positives per patient or volume (40 ^ 60 FPs for 
our lymph node or bone lesion data sets and ^ 150 FPs in 
colonic polyp cases). This performance standard can be easily 
attained by existing work El, Ea, ES, E3 



Fig. 4. CADe locations can be either observed as 2D image patches or using 
a 2.5D approach, that samples the image using three orthogonal views. Here, 
a lymph node in CT is shown as the input to our method. 


C. Random ConvNet Observations 

In order to increase the variation of the training data and to 
avoid overfitting analogous to the data augmentation approach 
in flAll . ifTTl and (TSl, multiple 2D or 2.5D observations 
per ROI or VOI are needed, respectively. Each ROI/VOI can 
be translated along a random vector v in the CT space Nt 
times. Furthermore, each translated ROI is rotated around 
its center Nr times by a random angle a = [0°,... ,360°]. 
These translations and rotations for each ROI are computed 
Ns times at different physical scales s (the edge length of each 
RO0 , but with fixed numbers of pixels by resampling (i.e., the 
physical pixel size will vary in the units of millimeters against 
different s). This procedure results in x x 

random observations of each ROI - an approach similar to 
1^ . Only 2D reformatting and sampling representation within 
an axial CT slice (axial reconstruction is the most common 
CT reconstruction imaging protocol) is employed when the 
inter-slice distances or slice thicknesses are 5mm or more. 

^Without loss of generality, the sampled 2D or 2.5D image patches or 
observations have the squared shape. 
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Following this procedure, both the training and test data sets 
can be expanded to larger scales, which will enhance the neural 
nets generality and trainability. A ConvNet’s predictions on 
these N random observations {Pi(x),... ,Pn} can then be 
simply averagecj^ at each ROI to compute a per-candidate 
probability: 


1 ^ 

p{x\{Pi{x),...,PNix)}) = 


( 1 ) 




Here, Pi{x) is the ConvNet’s classification probability com¬ 
puted for one individual 2D or 2.5D image patch. In theory, 
more sophisticated fusion rules can be explored, but simple 
averaging has proven to be effective for this experiment 
1^ . Furthermore, this random resampling method simply and 


scales 3D translations 3D rotations 
along 1 / by a 


a 

V 


■■ ■■ 
■■ ■■ 


Fig. 5. Image patches are generated from CADe candidates using different 
scales, 2D/3D translations (along a random vector v) and rotations (by a 
random angle a) in the axial/3D plane (The example shows a sclerotic bone 
lesion in CT). 

effectively increases the amount of available training data. 
In computer vision, translational shifting and mirroring of 
2D image patches are often used for this purpose ifT^ . By 
averaging the N predictions on random 2D or 2.5D views as 
in Eq.[^ the robustness and stability of ConvNet can be further 
increased in testing, as shown in Sec. Ill 


D. Candidate Generation 

In general, any CADe system with a reasonably high sensi¬ 
tivity level (e.g., ^ 95%) at an acceptable FP rate (e.g., <150 
per patient) can be used as a candidate location generation step 
in our proposed framework. Based on a reference data set, such 
a candidate can be then labeled as a ‘positive’ or ‘negative’ 
example and used to train a ConvNet. In this paper, we propose 
to apply the ConvNet as a second, more accurate classifier. 
This is a coarse-to-fine classification approach slightly inspired 
by other CADe schemes such as presented in 1241 although 
our methods are significantly different. 

In this study, we use three existing CADe systems that have 
previously been described in the literature: 

a) Detection of sclerotic spine metastases: we use a 
recent CADe method for detecting sclerotic metastases can¬ 
didates from CT volumes (H, 1^ (see Sec. |III-D] ). The spine 
is initially segmented by thresholding at certain CT attenuation 
levels and performing region growing. Furthermore, morpho¬ 
logical operations are used to refine the segmentation and 
allow the extraction of the spinal canal. Further information 

^We empirically evaluate several aggregation schemes on computing the 
final candidate class probability from a collection of ConvNet observations. 
Simple average performs the best and has good efficiency. 


on spine canal segmentation and partitioning is provided in 
1^ . Axial 2D cross sections of the vertebrae are then divided 
into sub-segments by a watershed algorithm based on local 
density differences oa. The CADe algorithm then finds initial 
detections that have higher mean attenuation levels, in contrast 
to their neighboring 2D sub-segments. Since the watershed al¬ 
gorithm may over-segment the image, similar 2D sub-segment 
detections are merged by performing an energy minimization 
based on graph-cut and attenuation thresholds. Finally, 2D 
detections on neighboring cross sections are combined to 
form 3D detections with a graph-cut based merger. Each 3D 
detection acts as a seed point for a level-set segmentation 
method that segments the lesions in 3D. This step allows us to 
compute 25 characteristic features, such as shape, size, loca¬ 
tion, attenuation, volume, and sphericity. Finally, a committee 
of SVMs is trained on these features. 

b) Detection of lymph nodes: we employ two preliminary 
CADe systems for detecting lymph node candidates from 
mediastinal and abdominal ll25l body regions (see Sec. 
[HFEl ), respectively. In the mediastinum, lungs are segmented 
automatically and shape features are computed at the voxel- 
level. The system uses a spatial prior of anatomical structures 
(such as the esophagus, aortic arch, and/or heart) via multi¬ 
atlas label fusion before detecting lymph node candidates us¬ 
ing a SVM for classification. In the abdomen, a random forest 
classifier is used to create voxel-level lymph node predictions 
via image features. Both systems permit the combination of 
multiple statistical image descriptors (such as Hessian blob- 
ness and HOG) and appropriate feature selection in order to 
improve lymph node detection beyond traditional enhancement 
filters. Currently, 94%-97% sensitivity levels at rates of 25-35 
FP/vol. can be achieved ((261, EH). With sufficient training 
in the lymph node candidate generation step, close to 100% 
sensitivities could be reached in the future. 

c) Detection of colonic polyps: we apply a candidate 
generation step using the CADe system presented in Ell 
(see Sec. III-H|i. In this system, the colonic wall and In- 


men are first segmented, and any tagged colonic fiuids are 
removed from CT colonography (CTC) volumes. In order 
to identify colonic polyps, we analyze local shape features 
(e.g. mean curvature, sphericity, etc.) of the colons surface for 
the generation of CADe candidates EH Even though l27l 
is a relatively straightforward approach for polyp detection 
compared to more recent data-driven colonic polyp CADe 
systems in the literature lEa, (381, it can serve as a sufficiently 
good candidate generation procedure when coupled with our 
random views of ConvNet observations and aggregation for 
effective false positive rejection. 

E. Cascaded CADe Architectures for False Positive Reduction 

There exist two types of cascaded CADe classification 
architectures for false positive reduction are two types: 1) 
extraction of new image features followed by retraining of 
a classifier on all candidates (391, lEHl, 0, (201, (40l (from 


Sec. II-D) or 2) design of application dependent post-filtering 
components ED, (42l, (43l . Different (often more computa¬ 
tionally expensive) image features are calculated per extracted 
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candidate, in order to reveal new information omitted from 
the CG step, since explicit brute-force search in CG is no 
longer necessary. Examples of heterogeneous CADe post¬ 
filters include the removal of 3D fiexible tubes ED, ileo-cecal 
valve Ea and extra-colonic findings |[4^ in CT colonography. 
Although training cascaded CADe systems using the same set 
of image features and the same type of classifier (e.g., SVM 
or random forest) is feasible, this approach often demonstrates 
less effective overall performance (as discussed later) and is 
less employed. In this paper, we mainly exploit the first type of 
cascade, which uses deep ConvNet models as new components 
of integrated image feature representation and classification. 


III. Evaluation and Results 


A. Imaging Data Sets and Implementation 

We evaluate our method on three medical imaging data 
sets that illustrate common clinical applications of CADe in 
CT imaging: sclerotic metastases in spine imaging, lymph 
nodes and colonic polyps in cancer monitoring and screening. 
We also show the scalability of ConvNets to different data 
set sizes, i.e. 59, 176 (86 abdominal, 90 mediastinal) and 
1,186 patients per data set respectively. Some statistics on 
patient population, total/mean (target) lesion numbers, total 
true positive (TP) and false positive (EP) candidate numbers, 
mean candidate numbers per case are given in Table |l| Note 
that one target can have several TP detections. Eor all imaging 
data sets used in this study, the image patches were centered 
at each CADe coordinate (of candidate VOI centroid from 
pre-existing CADe systems (H, ll26l . ||25]| . (271) with 32 x 32 
pixels in resolution. All patches were sampled at 4 scales of 
s = [30,35,40,45] mm ROI edge length in physical image 
space, after isotropic resampling of the input CT images (see 
Fig. 0. These scales cover the average dimensions for all 
objects of interest in the imaging data sets used in this study. 
Eurthermore, all ROIs were randomly translated (up to 3 mm) 
and rotated at each scale (thus Ng = Nt = b and = 5), 
resulting in = 100 image patches per ROI. Due to the much 
larger data set in the colonic polyp case, the parameters were 
chosen to be Ng = Nt = 2 and = 5), resulting in 
A^ = 40 image patches per ROI. 

The training times for each ConvNet model were approxi¬ 
mately 9-12 hours for the lymph node data set, 12-15 hours 
for the bone lesions data set, and 37 hours for the larger 
colonic polyps data set. All training was performed using 
a NVIDIA GeEorce GTX TITAN (6GB on-board memory) 
for 1200 optimization epochs with unit Gaussian random 
parameter initializations as in (2811 . Running N = 100 2D 
or 2.5D image patches at each ROWOI for classification of 
one CT volume only took circa 1-5 minutes. Image patch 
extraction from one CT volume lasted around 2 minutes at 
each scale. The employed ConvNet architecture is illustrated 
in Fig. [T] 


TABLE II 

Improvement with ConvNet Integration: previous^ CADe 

PEREORMANCE COMPARED TO CONVNET^ PEREORMANCE AT THE 3 

FPs/patient rate. 


Dataset 

Sensitivity^ 

Sensitivity^ 

AUCi 

AUC^ 

sclerotic lesions 

57% 

70% 

n/a 

0.83 

lymph nodes 

43% 

77% 

0.76 

0.94 

colonic pOlypS(>=6mm) 

58% 

75% 

0.79 

0.82 

colonic pOlypS(>=10mm) 

92% 

98% 

0.94 

0.99 


B. Trained ConvNet Filter Kernels 

The trained filters of the first convolutional layer for all 
three imaging data sets used in this study can be seen in Eig. 

A mixed set of low and high frequency patterns exists in 
the first convolutional layer. The filter kernels “capture” the 
essential information that is necessary for each classification 
task. These automatically learned filters need no tuning by 
hand, and thus have a major advantage over more traditional 
CADe approaches ca. In Eig.j^a), the learned convolutional 
filters for sclerotic metastases are one-channel only (encoded 
in gray scale and learned from axial CT images); b,c), the con¬ 
volutional filters for lymph nodes or colonic polyps are three- 
channels (encoded in RGB and trained using three orthogonal 
CT views per example). Different visual characteristics of 
ConvNet filter kernels are discussed in Eig. as well. 


C. 2D, 2.5D and 3D ConvNet Configurations 


In this experiment, we compare the CADe performance 
of varying dimensional inputs to that of our ConvNet ar¬ 
chitecture: 2D ROIs, the proposed 2.5D VOIs and 3D VOI 
stacks. The effect of data augmentation for ConvNet training is 
evaluated on the abdominal lymph node data set. An 80%/20% 
split of 86 patients is used for training and testing, respectively. 


Eig. 14 shows the EROC performance for both training (Left) 
and testing (Right). It can be observed that a pure 2.5D 
approach on the original CT data is not sufficient to capture 
the variety of lymph nodes in the test set. However, adding 
the proposed random observations in both training and testing 
(as a form of data augmentation) leads to the best performing 
CADe framework at a level of 3 EPs/voL, compared to 2D 
and 3D approaches. 

In the 3D case, we extract full 32 x 32 x 32 VOI image stacks 
as input to our ConvNet. In this case, the amount of training 
data is also not enough to learn all parameters of the ConvNet 
without data augmentation in order to generalize well to the 
testing data. Clear overfitting occurs in testing, highlighting 
the advantages of using a 2.5D approach in applications where 
training data can be too limited (as in many medical imaging 
problems). Yet, adding data augmentation to the training set 
improves the performance in 3D markedly with the trade¬ 
off of adding ^ 4x more training time in order to achieve 
convergence (see Table and performs only comparable to 
the augmented 2.5D case. 
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TABLE I 

cade data sets used eor evaluation: sclerotic metastases, lymph nodes, colonic polyps. 


Dataset 

# Patients 

# Targets 

# TP 

# FP 

# Mean Targets 

# Mean Candidates 

sclerotic lesions 

59 

532 

935 

3,372 

9.0 

73.0 

lymph nodes 

176 

983 

1,966 

6,692 

5.6 

49.2 

colonic polyps 

1,186 

252 

468 

174,301 

0.2 

147.4 



Fig. 6. The first layer of 64 learned convolutional kernels of a ConvNet trained on medical CT images on each of three different CT imaging data sets: a) 
sclerotic metastases, b) lymph nodes and c) colonic polyps. The color coding in b) and c) illustrates the filters kernels used in each orthogonal view when 
using our 2.5D approach. The learned convolutional filters for sclerotic metastases in a) are using one-channel as input only (encoded in gray scale and 
learned from axial CT images). Here, complex higher order gradients, blobness and difference of Gaussian filters dominate. In b,c), the convolutional filters 
for lymph nodes or colonic polyps are three-channels (encoded in RGB and trained using three orthogonal CT views per example). Kernels learned from 
lymph nodes are mostly blobness and gradients of different orientations/channels in b). Colonic polyp kernels in c) are visually more diversified than the 
filters in b), especially with new “pointy” patterns probably resembling polyp intrusions from 3D colonic surfaces or tips. 


TABLE III 

Training times until convergence in the 2D vs. 2.5D vs. 3D 

CASES ON THE ABDOMINAL LYMPH NODE DATA SET: 


Input Dimensions 

Augmentation 

Time (min) 

2D 

no 

123 

2D 

yes 

847 

2.5D 

no 

59 

2.5D 

yes 

476 

3D 

no 

119 

3D 

yes 

1844 



D. Detection of Sclerotic Metastases 

In our evaluation, radiologists labeled a total of 532 sclerotic 
metastases in CT images of 49 patients (14 female, 35 male 
patients; mean age 57.0 years; age range of 12-77 years). 
A lesion is only labeled if its volume is greater than 300 
mm^. These CT scans have reconstruction slice thicknesses 
ranging between 2.5 mm and 5 mm. Furthermore, we include 
10 control cases (4 female, 6 male patients; mean age 55.2 
years; age range of 19-70 years) without any spinal lesions. 


Note that 2.5-5 mm thick-sliced CT volumes are used for 
this study (for low dose CT radiation). Due to this relatively 
large slice thickness, our spatial transformations are all drawn 
from within the axial plane, i.e. following the 2D approach 
introduced in Sec. [iTB] Coronal or Sagittal image views 
demonstrate low longitudinal resolutions and thus have poor 
diagnostic quality. 


Any false-positive detection from the candidate generation 
step on these patients is used as a “negative” candidate 
example in training the ConvNet. This strategy would be 
considered as “hard negative mining” or “bootstrapping” in the 
general computer vision or statistics literature. The maximum 
sensitivity of this candidate generation step in testing was 
88.9% d. All patients were randomly split into five sets at 
the patient level in order to allow a 5-fold cross-validation. We 
adjust the sample rates for positive and negative image patches 
in order to generate a balanced data set for training (i.e., 
50% positives and 50% negatives). This means all randomly 
sampled positives are included in training, but only a subset of 
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Fig. 7. Detection of sclerotic metastases: test probabilities of the ConvNet for 
being sclerotic metastases on ‘true’ sclerotic metastases candidate examples 
(1.0 equals 100% probability of representing a true positive). 



False Positive: D = 0.56 False Positive: d = 0.23 False Positive: o = 0.10 False Positive: o = 0.03 False Positive: d = 0.31 



Fig. 8. Detection of sclerotic metastases: test probabilities of the ConvNet for 
being sclerotic metastases on ‘false’ sclerotic metastases candidate examples 
(0.0 equals 100% probability of representing a false positive). 


negative random samples are used. Balancing between positive 
and negative training populations is generally beneficial for 
training ConvNets when optimizing with logistic regression 
cost ns, d. For this data set, a 2D approach is used: each 
2D image patch was centered at the CADe coordinate with 
32 X 32 pixels in resolution. As stated in Sec. |III-A] all patches 
are sampled at 4 scales of 5 = [30,35,40,45] mm ROI edge 
length in the physical image space, after isotropic resampling 
of the CT images (see Fig. |^. In this data set, we use a bone 
window level of [-250, 1250 HU]. We now apply the trained 
ConvNet to classify image patches from the test data sets. 
Figure and Fig. show typical classification probabilities 
on two random subsets of positive and negative ROIs in the 
test case, respectively. 

Averaging the N predictions at each CADe candidate 
allows us to compute a per-candidate probability p{x), as 
in Eq. Varying thresholds on probability p{x) are used 
to compute Free-Response Receiver Operating Characteristic 
(FROC) curves. FROC curves are compared in Fig. for 
the configurations of varying N and demonstrate that the 
classification performance saturates quickly with increasing 
N. If N < 100, we use a random subset of observations 
to compute the average prediction value. This means the run¬ 
time efficiency of our second layer detection could be further 
improved without losing noticeable performance by decreasing 
N. The proposed method reduces the number of FPs/patient of 
the existing sclerotic metastases CADe systems (41 from 4 to 
1.2, 7 to 3, and 12 to 9.5 when comparing sensitivity rates of 
60%, 70%, and 80% respectively in cross-validation testing (at 



---ConvNetwith N = 1 (AUC: 0.823) 
---ConvNet with N = 5 (AUC: 0.830) 

- - - ConvNet with N = 10 (AUC: 0.834) 

- - - ConvNet with N = 25 (AUC: 0.834) 

ConvNet with N = 75 (AUC: 0.834) 
—ConvNet with N = 100 (AUC: 0.834) 
---Maximum Input Sensitivity (0.889) 
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Fig. 9. Detection of sclerotic metastases: FROC curves for a 5-fold cross- 
validation using varying numbers of N random view ConvNet observations in 
testing of 59 patients (49 with sclerotic metastases and 10 normal controls). 
AUC values are computed for corresponding ROC curves. 



Fig. 10. Detection of sclerotic metastases: comparison of FROC curves of 
the initial bone lesion candidate generation (squares) compared to the final 
classification using N = 100 random view ConvNet observations (lines) for 
both training and testing cases. Results are computed using a 5-fold cross- 
validation in 59 patients (49 with sclerotic metastases and 10 normal controls). 


N = 100). The Area-Under-the-Curve (AUC) values remain 
stable at 0.834 for N between [10,..., 100]. 

Fig. [T^ compares the FROCs from the initial (first layer) 
CADe system (H and illustrates the progression towards the 
proposed coarse-to-fine two tiered method in both training and 
testing datasets. This clearly demonstrates a marked improve¬ 
ment in performance. The FROC performance differences 
from training to testing in both cases still show some degree of 
overfitting, which can be addressed by including more patient 
data (59 patients are in general too few to train ConvNets to 
generalize well). This observation is insightful for later work 
on deep learning system design for medical diagnosis. 


E. Detection of Thoracoabdominal Lymph Nodes 

The next data set consists of 176 patients that are used for 
CADe of lymph nodes. Here, the slice thickness of CT scans 
was <1 mm. Hence, we were able to apply a 2.5D approach 
(composite of three orthogonal 2D views) for sampling each 
CADe candidate as described in Sec. |II-B Radiologists labeled 
a total of 388 mediastinal lymph nodes and 595 abdominal 
lymph nodes as ‘positives’ in the CT images. In order to 
objectively evaluate the performance of our ConvNet based 

































Fig. 11. Detection of lymph nodes: test probabilities of the ConvNet for 
being a lymph node on ‘true’ (top box) and ‘false’ (bottom box) lymph node 
candidate examples. 
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Fig. 12. Detection of lymph nodes: FROC curves for a 3-fold cross-validation 
using a varying number of N random view ConvNet observations in 176 
patients. AUC values are computed for corresponding ROC curves. The 
previous performance by ED is shown for comparison. 


2.5D detection approach, 100% sensitivity at the lymph node 
candidate generation stage for training is assumed by injecting 
the labeled lymph nodes into the set of CADe lymph node 
candidates (see Sec. II-D ). The CADe system produces a total 
of 6,692 false-positive detections (>15 mm away from true 
lymph node) in the mediastinum and the abdomen. These 
false-positive detections are used as ‘negative’ lymph node 
candidate examples for training the ConvNets. There are a total 
of 1956 true-positive detections from 1^ . 1^ . All patients 
are randomly split into three subsets (at the patient level) 
to allow a 3-fold cross-validation. We use different sample 
rates of positive and negative image patches to generate a 
balanced training set. This proves beneficial for training the 
ConvNet. Each three-channel image patch (as a 2.5D view) is 
centered at a CADe coordinate with 32 x 32 pixels. Again, 
all patches are sampled at 4 scales: s = [30,35,40,45] mm 
for the VOI edge length in the physical image space, after 
isotropic resampling of the CT images (see Fig. |^. We use 
a soft-tissue window level of [-100, 200 HU] as in ll44]| . 
Furthermore, all VOIs are A" = 100 times randomly translated 
(up to 3 mm) and rotated at each scale. After training, we 
apply the trained ConvNet to classify image patches from the 
testing datasets. Figure im shows some typical classification 
probabilities on a random subset of test VOIs. Averaging 
the N predictions at each lymph node candidate allows us 
to compute a per-candidate probability p{x), as in Eq. 
Varying a threshold parameter on this probability allows us 
to compute the free-response receiver operating characteristic 
(FROC) curves. Different FROC curves are compared in Fig. 
\r2\ with varying N. It can be observed that the classification 
performance saturates quickly with increasing N, consistent 
with Sec. |III-D[ The classification sensitivity improves on the 
existing lymph node CADe systems 1^ , 1^ from 55% to 
70% in the mediastinum and from 30% to 83% in the abdomen 
at a low rate of 3 FP per patient volume (FP/voL), for A = 100 
1^ . The AUC improves from 0.76 to 0.942 in the abdomen, 
when using the proposed false-positive reduction approach 
(AUC for the mediastinal lymph nodes was not available for 
comparison). At an operating point of 3 FP/voL, we achieve 


significant improvement: p < 0.001 in both mediastinum and 
abdomen, respectively (Fisher’s exact test). 

Further experiments show that performing a joint ConvNet 
model trained on both mediastinal and abdominal lymph node 
candidates together can improve the classification by ^10% 
to ^80% sensitivity improvements (case by case) at 3 FP/vol. 
in the mediastinal set. The overall 70% sensitivity at 3 FP/vol. 
increases to 77% in the mediastinum. The sensitivity level in 
the abdomen datasets remains stable. We achieve a substantial 
improvement compared to the state-of-the-art methods in 
lymph node detection. ||45]| reports a 52.9% sensitivity rate 
at 3.1 FP/vol. in the mediastinum, while achieving a rate of 
70% (201 01* 77% (joint training) at 3 FP/vol. In the abdomen, 
the most recent work 0l46l ) shows a 70.5% sensitivity rate at 
13.0 FP/vol. We obtain 83% at 3 FP/vol. (assuming ^100% 
sensitivity at the lymph node candidate generation stage). 
Note that any direct comparison to another recent work is 
difficult since common datasets were not previously utilized. 
Therefore, our data sejj] and supporting materia]^ have been 
made publicly available for future comparison purposes. 


E 2.5D ConvNets Compared to Shallow Classification 

We compare our 2.5D approach to other means of second 
tier classification (FP filter or “killer”), e.g., linear SVM 
based on Histogram of Oriented Gradients (HoG) features 
as proposed in i). Here, both simple pooling and sparse 
linear decision fusion schemes to aggregate 2D detection 
scores are exploited for the final 3D lymph node detection. 
This type of cascade classification is similar in spirit to our 
presented second tier deep classifier (ConvNet), but uses state- 
of-the-art shallow classifiers (libSVM im and sparse linear 
fusion via the Relevance Vector Machine (481 ). As shown 


in Fig. a clear advantage of using the proposed 2.5D 
ConvNet method can be observed (unlike in (61). Note that 
this shallow linear cascade approach via new image features, 
such as Histogram of Oriented Gradients, already significantly 


^ http ://www. cc .nih. gov/about/SeniorStaff/ronald_summers .html 
'http://dx.doi.org/10.7937/K9/TCIA.2015.AQIIDCNM 
^ www.holgerroth.com 






































9 


surpasses previous state-of-the-art methods Ea, da, EH. 
Furthermore, we use the same set of image features and 
random forest classifiers in a two-tiered cascade of hierarchy 
1^ . No improvement in CADe performance is observed. 
This highlights the importance of leveraging heterogeneous 
image features in the two stages of candidate generation and 
candidate classification. 



False Positive Rate per Patient 


Fig. 13. Comparison of the FROC performance of the previous method as 
candidate generation step using a random forest classifier Ea against an 
alternative second level classification approach using histogram of oriented 
gradients (HoG) (6) and the proposed 2.5D ConvNet approach using ConvNet 
observation on A/^ = 100 random views. 


G. 3D, 2D or 2.5D ConvNets: Alleviating Curse-of- 
dimensionality via Random View Aggregation 

Medical images are intrinsically 3D, but relative to other 
computer vision problems, CADe problems often lack suffi¬ 
cient training data to learn 3D models effectively (see Fig. 
T^. From the perspective of the ‘curse of dimensionality’, 
a 3D task requires at least one order of magnitude more 
training data than a 2D task. This problematic data distribution 
setting can hamper the performance of learning algorithms in 
CADe, thus motivating us to exploit the 2D/2.5D decomposi- 
tional sampling and aggregation representation. The number 
of training instances has been increased up to 100 times 
(although not independent and identically distributed samples) 
for training ConvNets, without directly learning the complex 
and explicit 3D object representation and classification. Like¬ 
wise, the compositional two-stream 2D ConvNet models run 
on separate spatial (RGB) and temporal (i.e., optical fiow 
field) video frames and achieve the mean accuracy of 87.9% 
in action classification task, based on a middle scale dataset 
UCF-101 (sol . This result significantly outperforms the direct 
3D “spatial-temporal” ConvNet method ISTll at 65.4% (mean 
accuracy), evaluated on the same UCF-101 benchmark. 

we conduct extensive empirical evaluation 


In Fig. 14 


and comparative study using 3D, 2D or 2.5D ConvNets for 
lymph node detection. 1), The “ORIG” versions of 3D, 2D 
or 2.5D ConvNets demonstrate consistently better training 
performance than the “AUG” setting (i.e., more data in “AUG” 
cause harder to over-fit), as illustrated in Fig.[^Left. However 
in testing, 3D, 2D or 2.5D ConvNets trained under data 
augmentation or “AUG” all clearly outperform their “ORIG” 
counterparts. 2). Without data augmentation, the more complex 
3D ConvNet model shows a great decline in performance 
between training to testing compared to the 2D and 2.5D 
ConvNets, which indicates stronger over-fitting due to curse- 
of-dimensionality (Fig. [T^Right). In the “ORIG” setting, 2.5D 


and 2D ConvNets give noticeably better testing FROC results 
(while being comparable overall between themselves), fol¬ 
lowed by the 3D ConvNet. Consequently, this observation val¬ 
idates the concept that simpler or lower-dimensional learning 
models generalize better than complex ones without sufficient 
available training data (as in “ORIG” setting). 3). Data aug¬ 
mentation based on random view aggregation, as proposed in 
our original work ( ll2Ql ). effectively circumvents the “curse-of- 
dimensionality” or “over-fitting” issue in the data-demanding 
ConvNet training procedures. This strategy has been adapted 
to computer-aided pulmonary embolism detection ( 1521 ). lung 
nodule classification (im, isi) in CT images and polyp 
detection in colonoscopy videos ( l54l . l55l ). 4), The 2.5D 
and 3D (“AUG”) ConvNets dominate 2D (“AUG”) ConvNet 
in most of FROC ranges; while 2.5D ConvNet performs the 
best in the FP range of [2-4] than the other two models. Overall 
2.5D ConvNet performs comparably (in both training and 
testing) to the more computationally expensive 3D ConvNet 
configuration, as augmented 3D volumetric VOI inputs are 
required. In summary, the evaluated 2.5D “AUG” ConvNet 
is selected as the best trade-off lymph node detection model, 
when detection performance and computational efficiency are 
taken into account. 


H. Detection of Colonic Polyps 

In CT colonography (CTC), patients are typically scanned 
in the prone and supine positions |56l, so we obtain two CT 
volumes per patient study. We use CTC images from three 
institutions in this study. A total of 1,186 patients with prone 
and supine CTC images were included (as in IZTl ). In this 
data set, each polyp >6 mm found at optical colonoscopy 
was located on the prone and supine CTC examinations using 
3D endoluminal colon renderings with “fly-through” viewing 
and multiplanar reformatted images. 

The patients were separated into training (n = 394) and 
testing sets (n = 792) with similar age and gender distributions 
- an approximate 1:2 split. There were 79 training and 173 
testing polyps (>=6mm); and 22 training and 37 testing 
polyps (> = 10mm, considered as large polyps) in our CTC 
dataset Czl. The candidate generation step for colonic polyps 
is performed by the CADe system presented in EH. In this 
system, the colonic wall and lumen are first segmented, and 
any tagged colonic fluids were removed. To identify colonic 
polyps, the 3D colon surface undergoes an examination on 
shape filtering features to generate CADe findings or candi¬ 
dates ED. 

The FROC curves for detecting adenomatous polyps of 
>6 and >10 mm, respectively, are shown in Fig. for a 
varying number of observations N. The performance saturates 
quickly after N = 10 random observations. At both polyp 
size thresholds, a large improvement in sensitivity at all false¬ 
positive rates can be observed. In all cases, the sensitivity 
levels were higher for larger polyps at constant false-positive 
rates. At a rate of 3 FPs per patient for polyps >6 mm, the 
sensitivities per patient were raised from 58% using a SVM 
classifier (as in (271) to 75% using our 2.5D ConvNet approach 
(see Table These results are comparable to other already 
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False Positive Rate per Patient False Positive Rate per Patient 

Fig. 14. Comparison of the FROC performance of when training a ConvNet with 2D, 2.5D and 3D inputs of the original (“ORIG”) or augmented (“AUG”) 
CT data. In the “ORIG” setting, 2D ConvNet shows the best generalized testing FROC result, followed by 3D and 2.5D ConvNets. The 2.5D approach using 
aggregation of random observations (“AUG”) in both training (Left) and testing (Right), out-performs both 2D and 3D approaches on the original data at the 
3 FPs/patient level. The 2.5D ConvNet trained on augmented data overall performs comparably to a more computationally expensive 3D ConvNet approach 
on augmented 3D inputs. In brief, the evaluated 2.5D “AUG” ConvNet is chosen as the best trade-off lymph node detection model between effectiveness and 
efficiency. 


highly tuned CADe systems for colonic polyp detection in 
CTC, such as ||37l, (381. 

Note that our system achieves significantly higher sensitiv¬ 
ities of 95%, 98% at 1 or 3 FP/vol. for clinically actionable 
>10 mm polyps, compared to sensitivities of 82% at 3.65 
FP/vol. in (33 and 76% at 1 FP/vol.; 95% at 4.5 FP/vol. 
for (3^ . The hierarchical voxel labeling CADe approaches for 
colonic polyps Ell, iol better handle smaller polyps (>6 but 
< 10 mm), at 84.7% sensitivity with less than 3.62 FP/vol. but 
exhibit inferior performance on clinically more important and 
relevant large polyps. Note that the results between our work 
and previous methods Oil, mni, (38l are not possible to be 
strictly compared since different datasets are evaluated. The 
colonic polyp CADe dataset scales are similar: 770 tagged- 
prep CT scans from multiple medical sites (358 training and 
412 validation) in (33 . (40]l : 180 patients (360 CTC volumes) 
for training and 202 patients (404 volumes) for testing (38|. 

Finally, operating at 1 FP/patient to obtain about 95% 
sensitivity in testing (improved from ^65% in (23) for >10 
mm large polyp detection is a desirable clinical setting for 
employing CADe as a second reader mode, with a minimal 
extra burden for radiologists. In (381 . approximately four times 
more effort is needed to review FPs (i.e., retaining 95% 
sensitivity at 4.5 FP/vol.). 


I. Limitation & Improvement 

Although consistent FROC improvements are observed in 
Fig. [T^ for both polyp categories of >6 and >10 mm, 
our final system demonstrates more appealing performance 
for large polyps (i.e., >10 mm). Achieving 95% sensitivity 
at 1 FP/patient. in testing is the best reported quantitative 
benchmark, to the best of our knowledge, for a large-scale 
colonic polyp CADe system. For polyps between 6 and 9 mm. 



False Positive Rate per Patient 


Fig. 15. Detection of colonic polyps: FROC curves for different polyp sizes, 
using up to A = 40 random view ConvNet observations in 792 testing CT 
colonography patients. 


our random 2.5D view sampling may not be optimal due to the 
smaller object size to detect (a portion of sampled 2.5 images 
may contain only some tiny fields-of-view of the target polyp). 
Potentially, the performance could be improved by adopting a 
local colonic surface alignment, such as (39l . to further guide 
and constrain our random view sampling procedure. 

IV. Discussion and Conclusions 

This work (among others, such as and mi) reveals that 
deep ConvNets can be extended to 2D and 3D medical image 
analysis tasks. We demonstrate significant improvements on 
CADe performance of three pathology categories (i.e., bone 
lesions, enlarged lymph nodes and colonic polyps) using CT 
images. Building upon existing CADe systems, we show that 
a random set of ConvNet observations (via both 2D and 
2.5D approaches) can be exploited to drastically improve the 
sensitivities over various false-positive rates from initial CADe 
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detections. Sampling at different scales, random translations 
and rotations around each of the CADe detections can be 
employed to prevent or alleviate overfitting during training 
and increase the ConvNet’s classification performance. Sub¬ 
sequently, the testing FROC curves exhibit marked improve¬ 
ments on sensitivity levels at the range of clinically relevant 
FP/vol. rates in all three evaluated CT imaging data sets. 
Furthermore, our results indicate that ConvNets can improve 
the state-of-the-art (as in the case of lymph nodes) or are at 
least comparable to already highly tuned CADe systems, as in 
the case of colonic polyp detection (371, (401 . (38l . 

The main purpose of a 2.5D approach is to decompose 
the volumetric information from each VOI into a set of 
random 2.5D images (with three channels) that combine the 
orthogonal slices at N reformatted orientations, in the original 
3D imaging space. Our relatively simple re-sampling of the 
3D data circumvents the usage of 3D ConvNets directly (23l . 
This not only greatly reduces the computational burden for 
training and testing, but also more importantly, alleviates the 
curse-of-dimensionality problem. Direct training of 3D deep 
ConvNets (23l for a volumetric object detection problem may 
currently cause scalability issues when data augmentation is 
not feasible or often severe lack of sufficient training samples, 
especially in the medical imaging domain. ConvNets generally 
need tremendous amounts of training examples to address the 
overfitting issue, with respect to the large number of model 
parameters. Data augmentation can be useful, as shown in 
this study, but the trade-off between computational burden and 
classification needs to be made. A 2.5D approach as proposed 
here can be a valid alternative to using 3D inputs. Random 
resampling is an effective and efficient way to increase the 
amount of available training data in 3D, as in the presented 
approach. ca uses translational shifting and mirroring of 
2D image patches for this purpose. Our 2.5D representation 
is intuitive and applies the success of large-scale 2D image 
classification, using ConvNets (T^ effortlessly into 3D space. 
The above averaging process (i.e., Eq. [T]) further improves 
the robustness and stability of 2D/2.5D ConvNet labeling on 
random views in validation or testing (see Sec. 

A secondary advantage of using 2.5D inputs may be that 
ConvNets that are pre-trained on larger data bases available 
in the computer vision domain (such as ImageNet) could be 
used. Potentially allowing the ConvNet optimization to start 
from an initialization that is better than starting from Gaussian 
random parameters Ell, ED. 

Potentially, larger and deeper convolutional neural networks 
could be applied to further improve classification perfor¬ 
mance (591 . (60l . However, the curse-of-dimensionality prob¬ 
lem makes it difficult to assess the amount of necessary data 
that is needed to effectively train these very deep networks. 
Extensions of ConvNets to 3D have been proposed, but com¬ 
putational cost and memory consumption can be still too high 
to efficiently implement them on current computer graphics 
hardware units (23l . 

Finally, the proposed 2D and 2.5D generalization of Conv¬ 
Nets is promising for various applications in computer-aided 
detection of 3D medical images. For example, the 2D views 
with the highest probability of containing a lesion could be 


used to present “classifier-guided” reformatted visualizations 
at that orientation (optimal to the ConvNet) to assist in 
radiologists’ reading. In summary, we present and validate the 
use of 3D VOIs with a new 2D and 2.5D representation that 
may easily facilitate a generally purposed 3D object detection- 
by-classification scheme. 
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